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Abstract 

This  reports  provides  an  overview  of  the  findings  and  software  that  have  evolved  from  the  ’’Use  of  Minimal  Lexieal 
Coneeptual  Struetures  for  Single-Doeument  Summarization”  projeet  over  the  last  six  months.  We  present  the  major 
goals  that  have  been  aehieved  and  diseuss  some  of  the  open  issues  that  we  intend  to  address  in  the  near  future.  This 
report  also  eontains  some  details  on  the  usage  of  some  software  that  has  been  implemented  during  the  projeet. 
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PARTNER  ORGANIZATIONS  THAT  HAVE  PROVIDED  RESOURCES  OR  COLLABORATED  ON  RESEARCH 

Richard  Schwartz,  BBN  Technologies 

2  COLLABORATIONS  (BROADLY  CONCEIVED) 

Presentations  by  Bonnie  Dorr  to  Georgetown  on  the  use  of  linguistic  information  in  hybrid  statistical/symbolic 
tasks  (summarization,  machine  translation,  divergence  unraveling). 

3  PROJECT  FINDINGS 

1.  We  have  shown  the  effectiveness  of  combining  sentence  compression  and  topic  lists  to  construct  informative 
summaries. 

2.  We  carried  out  experiments  were  three  approaches  to  automatic  headline  generation  (Topiary,  Trimmer  and  Un¬ 
supervised  Topic  Discovery)  were  compared  using  two  automatic  summarization  evaluation  tools  (BLEU  and 
ROUGE). 
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3.  We  have  stressed  the  importanee  of  eorrelating  automatie  evaluations  with  human  performanee  of  an  extrinsie 
task,  and  have  proposed  event  traeking  as  an  appropriate  task  for  this  purpose. 

4.  Bonnie  Dorr  and  her  student  David  Zajie  (in  eollaboration  with  Rieh  Sehwartz  at  BBN)  eompeted  in  the  Doeument 
Understanding  Conferenee  (DUG) — a  summarization  evaluation  eondueted  by  NIST.  Their  headline  generator, 
Topiary,  was  evaluated  automatieally  using  a  new  metrie  ealled  Rouge.  The  Topiary  system  plaeed  first  (out  of  40 
systems)  in  the  headline  (up  to  75  eharaeters)  summarization  task. 

5.  In  the  area  of  single-doeument  (monolingual)  summarization.  Topiary  plaeed  first  (out  of  40  systems)  on  3  Rouge 
measures  and  was  the  only  system  on  this  task  to  seore  better  than  a  human  summary  on  one  measure.  On  the 
single-doeument  (eross-lingual)  traek.  Dorr’s  team  plaeed  2nd  on  4  Rouge  measures. 

6.  A  preliminary  user  study  where  users  have  to  judge  the  relevanee  of  a  doeument  given  the  full  doeument  versus 
the  headline  shows  using  headlines  lead  to  similar  preeision  and  reeall,  but  reduee  the  time  it  takes  to  assess  the 
doeuments  by  a  faetor  of  4. 

4  OPPORTUNITIES  FOR  TRAINING  AND  DEVELOPMENT  (AT  ALL  GRADE  LEVELS) 

The  automatieally  generated  headlines  allow  users  to  assess  the  relevanee  of  a  doeument  in  a  time  effieient  way. 

In  addition,  for  eross-lingual  headline  summarization,  it  allows  user  who  does  not  understand  the  language  in  whieh  the 
original  doeument  was  authored,  to  assess  quiekly,  whether  it  relevant  enough  for  being  translated  by  a  human  translator. 

5  PUBLICATIONS  AND  PRODUCTS 

5,1  JOURNAL/CONFERENCE  PUBLICATIONS 

D.  Zajie,  B.  J.  Dorr,  and  R.  Sehwartz.  BBN/UMD  at  DUC-2004:  Topiary.  In  Proceedings  of  the  North  American 
Chapter  of  the  Association  for  Computational  Linguistics  Workshop  on  Document  Understanding,  Boston,  MA, 
2004. 

Dorr,  Bonnie  J.,  David  Zajie,  and  Riehard  Sehwartz,  ’’Hedge:  A  Parse-and-Trim  Approaeh  to  Headline  Genera¬ 
tion”,  Proceedings  of  the  HLT-NAACL  Text  Summarization  Workshop  and  Document  Understanding  Conference 
(DUC  2003),  Edmonton,  Canada,  pp.  1-8,  2003. 
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Dorr,  Bonnie  J.,  Daqing  He,  Jun  Luo,  Douglas  W.  Oard,  Riehard  Sehwartz,  Jianqiang  Wang,  and  David  Zajie, 
’’iCLEF  2003  at  Maryland:  Translation  Seleetion  and  Doeument  Seleetion”,  Proceedings  of  the  Interactive  track 
for  the  Cross-Language  Evaluation  Forum  Workshop,  Trondheim,  Norway,  2003. 

Nizar  Habash  and  Bonnie  Dorr,  ’’CatVar:  A  Database  of  Categorial  Variations  for  English”,  in  Proceedings  of  the 
MT  Summit,  New  Orleans,  EA,  pp.  471^74,  2003. 

Nizar  Habash  and  Bonnie  Dorr,  ”A  Categorial  Variation  Database  for  English”,  Proceedings  of  North  American 
Association  for  Computational  Linguistics,  Edmonton,  Canada,  pp.  96-102,  2003. 

Nizar  Habash.  ’’Matador:  A  large  seale  Spanish-English  GHMT  system”.  In  Proceedings  of  the  MT-Summit, 
pages  149  156,  2003. 

Habash,  Nizar,  Bonnie  J.  Dorr,  and  David  Traum,  ’’Hybrid  Natural  Eanguage  Generation  from  Eexieal  Coneeptual 
Struetures”,  Machine  Translation,  18:2,  2003. 

5.2  ONE-TIME  PUBLICATIONS  (INCLUDES  BOOK  CHAPTERS  AND  DISSERTATIONS) 

David  Zajie.  Automatic  Generation  of  Informative  Cross-Lingual  Headlines  for  Text  and  Speech.  Thesis  Proposal, 
University  of  Maryland,  2003. 

5.3  OTHER  PRODUCTS 

1.  Trimmer:  Trimmer  generates  a  headline  for  a  news  story  by  eompressing  the  main  topie  sentenee  aeeording  to  a 
linguistieally  motivated  algorithm.  The  eompression  eonsists  of  parsing  the  sentenee  using  the  BBN  SIFT  parser 
and  removing  low-eontent  syntaetie  eonstituents.  Some  eonstituents,  sueh  as  eertain  determiners  (the,  a)  and  time 
expressions  are  always  removed,  beeause  they  rarely  oeeur  in  human-generated  headlines  and  are  low-eontent  in 
eomparison  to  other  eonstituents. 

2.  Topiary:  Topiary  is  a  modifieation  of  the  Trimmer  algorithm  to  take  a  list  of  topies  with  relevanee  seores  as 
additional  input.  The  eompression  threshold  is  lowered  so  that  there  will  be  room  for  the  highest  seoring  topie 
term  that  isn’t  already  in  the  headline. 

3.  Generation  Heavy  Maehine  Translation  system  (GHMT).  Currently,  GHMT  supports  Spanish  to  English  and  Chi¬ 
nese  to  English  translation.  In  this  projeet,  GHMT  is  adapted  in  a  way  that  allows  eross-lingual  summarization. 
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Online  demo  of  the  Spanish-English  GHMT  system:  http:  //clipdemos  .umiacs.umd.edu/matador/ 
Download:  http: / / clipdemos . umiacs .umd.edu/ghmt/GHMT-PAK.tar . gz. 

Installation: 

gunzip  GHMT-PAK.tar .gz 
tar  -xf  GHMT-PAK.tar 

documentation: 

GHMT:  GHMT-PAK/GHMT/install .  readme 


4.  depTrimmer.  A  cross-lingual  headline  generation  extension  for  GHMT.  depTrimmer  is  fully  integrated  into 
GHMT,  where  translation  and  sentence  compression  are  applied  in  tandem.  The  benefit  is  that  the  summariza¬ 
tion  algorithm  is  applied  to  a  language  independent  data  structure,  which  makes  it  easy  to  adapt  it  to  a  new  foreign 
language.  This  approach  (depTrimmmer)  is  currently  implemented  as  a  prototype  for  Spanish-English  GHMT, 
and  no  experimental  results  are  available  yet.  depTrimmer  works  on  the  same  data  structures  that  are  used  within 
GHMT,  viz.  normalized  dependency  trees.  The  dependency  trees  are  ’trimmed’  based  on  linguistic  information 
including  part-of  speech,  syntactic  function,  and  semantic  type.  depTrimmer  requires  the  GHMT  package,  see 
above.  Download:  http : //clipdemos .umiacs . umd.edu/ dept r immer /DEP TRIM- PAK. tar . gz. 

Installation: 

gunzip  DEPTRIM-PAK.tar .gz 
tar  -xf  DEPTRIM-PAK.tar 

documentation: 

depTrimmer:  DEPTRIM-PAK/install .  readme 

5.  CatVar:  A  Categorial  Variation  Database  for  English.  CatVar  is  an  extensive  is  an  extensive  database  of  morpho¬ 
logical  variation  for  English.  CatVar  is  integrated  into  GHMT  in  order  to  increase  the  flexibility  of  the  generation 
of  the  English  translation.  Online  demo:  http :  /  /clipdemos  .umiacs .  umd.edu/ catvar/ 

6  CONTRIBUTIONS 

6,1  CONTRIBUTIONS  WITHIN  THE  DISCIPLINE 

1 .  An  extensive  database  of  morphological  variation  for  English. 


2.  Robust  Machine  Translation  system  from  Spanish  to  English  and  Chinese  to  English. 


3.  A  suite  of  automatic  summarization  tools  (mono-lingual  and  cross-lingual). 


6.2  CONTRIBUTIONS  TO  OTHER  DISCIPLINES  (THIS  IS  NOT  EXPECTED  EROM  ALL  PROJECTS) 

The  project  is  relevant  to  the  augmentation  of  capabilities  useful  for  intelligence  analysts,  such  as  cross-lingual  summa¬ 
rization  and  data  mining. 

6.3  CONTRIBUTIONS  TO  RESOURCES  EOR  RESEARCH 

This  work  provides  an  integral  part  for  many  NLP  applications  that  require  cross-lingual  information  processing. 

6.4  CONTRIBUTIONS  BEYOND  SCIENCE  AND  ENGINEERING  (THESE  CAN  BE  SPECULATIVE) 

The  research  carried  out  in  this  project  contributes  to  the  development  of  cross-lingual  information  management  and 
processing  systems,  which  facilitates  laymen  and  professionals  in  accessing  information  that  is  authored  in  a  language 
they  do  not  understand. 

7  PLANS  EOR  THE  NEXT  YEAR,  IE  CHANGED 

We  intend  to  continue  the  integration  of  depTrimmer  into  Chinese-English  and  Arabic-English  GHMT  Additionally,  we 
plan  to  evaluate  it  on  the  DUC  2004  data  sets. 

Our  funding  for  this  project  ends  in  early  2005.  We  will  need  additional  funds  for  the  2  years  after  the  project  has  expired 
to  continue  the  high  level  of  activity  toward  this  effort  that  we  have  contributed  over  the  last  year.  Eor  this,  we  have  one 
proposal  currently  under  review: 

’’Divergence  Resolution  for  Interlingual  Variation  Encoding  (DRIVE)”,  REEEEX  submission.  Broad  Agency  An¬ 
nouncement  (BAA-04-01 -EH),  May  2004. 

8  SPECIAL  REPORTING  REQUIREMENTS,  IE  ANY 

None. 
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9  UNOBLIGATED  FUNDS  (ONLY  IF  OVER  20  % ) 


N/A 

10  SIGNIFICANT  CHANGE  IN  USE  OF  HUMAN  SUBJECTS 

None. 

A  GHMT 

REQUIRED  RESOURCES 

1.  EISP:  International  Allegro  CE  Enterprise  Edition  6.0  (Eranz  Ine.) 

2.  Perl:  v5.8.0 

3.  Connexor  parser  (English  and  Spanish)  (from  www.eonnexor.eom)  See  instruetions  below  on  hooking  up  the 
eonnexor  elient  to  the  rest  of  the  system. 

4.  Nitrogen  Morphology  Support 

Nitrogen  is  available  at:  http://www.isi.edu/natural-language/projeets/nitrogen/ 

Speeifieally,  the  morphology  files,  nitro.english.morph.lisp  nitro.morph.8.98.hsp  nitromorph-8-98.hsp  must  be 
plaeed  under  $PACKAGE/EXERGE/SOURCE/oxyexerge/ 

5.  Halogen  Eorest  Ranker 

Halogen  is  available  at:  http://www.isi.edu/heensed-sw/halogen/  Ah  eode  from  the  forest  ranker  should  be  in¬ 
stalled  under  $PACKAGE/HAEOGEN/EorestRanker 

Make  sure  the  variables  in  sysVars.eshre  are  added  to  your  .eshre 

The  souree  files  for  fhe  Exerge  system  are  ineluded  in  fhis  paekage  in  addifion  fo  ereafed  images  on  Solaris,  fo  remake 
fhese  images,  run  $PACKAGE/ake-Exerge.sh 

See  a  Sample  run  of  Mafador  below. 

CONNEXOR  SPECIEIC  INSTRUCTIONS 
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1.  Contact  www.connexor.com  to  obtain  a  license  for  English  and  Spanish  parsers. 

2.  Update  the  host/port  in  the  files  fdges-client.pl  (for  Spanish)  and  fdgen-client.pl  (for  English).  The  current  values 
should  look  like  this  for  fdges-client.pl: 

$remote_host=”  cheesecake.umiacs.umd.edu” 

Sremote  _port=”  11 720” 

and  as  follows  for  fdgen-client.pl 

$remote_host=”  cheesecake.umiacs.umd.edu” 

Sremote  _port=”  11 72 1” 

SAMPEE  RUN 

>  matador.pl  test  out2  x  params=params .matador . 2 
parameter  params  =  params .matador . 2  ...  loading 

Processing  Batch  #0 
PARSING. . . 

TRANSLATING. . . 

reading  /fs/clip-plus/habash/PACKAGE/TRANSLEX/Spanish-English/span-eng.tralex  ...  done 

translating  torotemp.dep  ... 

done 

CONVERTING, EXPANDING. . . 

/f s/clip-plus/habash/PACKAGE/EXERGE/ corexerge . sh  torotemp. trans . amr  torotemp . out . amr 
T  NIL  T  T  T  10  10  10  10  NIL  I  T  NIL  T 
<Running  CorExerge> 

;  Exiting  Lisp 
LINEARIZAING. . . 

<Running  OxyGen  2.0> 

;  Exiting  Lisp 
RANKING . . . 

/f s/clip-plus/habash/PACKAGE/EXERGE/halogenize  torotemp . out . gls  torotemp.out.txt  6 
/f s/clip-plus/habash/PACKAGE/HALOGEN/ForestRanker/ news .binlm 
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<GLS-to-Forest  Conversion>  &&  <Running  HALOGEN> 


/f s/clip-plus/habash/HALOGEN/ForestRanker/polishsen .pi 
/f s/clip-plus/habash/PACKAGE/MATADOR/halolin-temp . senO 
>  /f s/clip-plus/habash/PACKAGE/MATADOR/halolin-temp . sen 
;  cpu  time  (non-gc)  420  msec  user,  10  msec  system 

;  cpu  time  (gc)  70  msec  user,  0  msec  system 

;  cpu  time  (total)  490  msec  user,  10  msec  system 

;  real  time  23,139  msec 

;  space  allocation; 

;  332,622  cons  cells,  7,882,064  other  bytes,  0  static  bytes;  Exiting  Lisp 

REPORTING. . . 
done ! 

B  DepTrimmer 

REQUIRED  RESOURCES 

1 .  GHMT  System  (specifically  Matador  installation) 

2.  The  source  files  for  DepTrimmer  are  included  in  this  package  in  addition  to  created  images  on  Solaris,  to  remake 
these  images,  goto  $DEPTRIM-PAK/DEPTRIMMER/SOURCE  run  make 

See  a  Sample  run  of  DEPTrimmer  below. 

depTrimmer  takes  an  AMR  tree,  removes  parts  of  the  sentence  until  the  sentence  length  is  below  some  threshold,  then 
outputs  the  trimmed  AMR  tree. 

The  trimming  algorithm: 

1 .  Delete  all  determiners 

2.  Delete  all  punctuation 


3.  Delete  all  time  expressions 


4.  Delete  some  eonjunetions 


5.  Delete  some  relative  elauses 

6.  If  the  sentenee  is  too  long,  delete  all  eonjunetions 

7.  If  the  sentenee  is  too  long,  delete  all  relative  elauses 

8.  While  the  sentenee  is  too  long,  delete  prepositional  phrases  whieh  do  not  eontain  a  proper  noun 

9.  While  the  sentenee  is  too  long,  delete  all  prepositional  phrases 

10.  Clean  up  any  dangling  eonneetives 

Note  that  steps  1-5  take  plaee  *even  if  the  sentenee  is  already  below  the  threshold.* 

More  detailed  explanation  of  steps: 

1.  Delete  all  determiners  Determiners  aren’t  generally  needed  for  eomprehension.  Most  real  headlines  don’t  have 
them.  We  delete  anything  tagged  as  ’D’  (determiner). 

2.  Delete  all  punetuation  Punetuation  isn’t  very  important  for  eomprehension.  The  sophistieated  use  of  punetuation 
that  real  headlines  use  is  quite  diffieult.  We  delete  anything  tagged  as  ’PX’  (punetuation) 

3.  Delete  all  time  expressions  Time  expressions  are  generally  superfluous.  Relative  expressions,  like  ’’today”  are 
meaningless  after  that  day  has  passed.  When  speeifie  dates  appear,  they  generally  inelude  the  event  that  takes 
plaee  on  that  date,  whieh  is  more  useful  to  keep.  E.g.  in  ’’the  November  eleetions”,  ’November’  is  not  as  important 
as  ’eleetions’.  There  may  be  speeifie  time  expressions  whieh  should  be  exeluded  from  this,  e.g.  ’’the  September 
11th  investigation  eommittee”.  We  delete  anything  tagged  as  ’TIME’  (time  expressions) 

4.  Delete  some  eonjunetions  In  phrases  like  ”he  ran  away  and  hid  his  faee”  or  ’’the  President  and  the  Viee  President”, 
the  subordinate  phrase  is  generally  less  important.  If  any  phrase  (noun,  verb,  or  prepositional)  is  eonneeted  to  a 
phrase  of  the  same  type  by  a  eonjunetion,  the  subordinate  phrase  is  deleted.  Optionally,  we  delete  only  phrases 
whieh  do  not  eontain  a  proper  noun. 

5.  Delete  some  relative  elauses  In  phrases  like  ’’aetions  whieh  would  bring  about  ehanges”,  the  head  of  the  sub-phrase 
’’whieh  would  bring  about  ehanges”  is  the  verb  ’’bring”.  Verb  phrases  whieh  are  direet  ehildren  of  noun  phrases 
are  generally  less  important,  and  we  delete  them.  Optionally,  we  delete  only  phrases  whieh  do  not  eontain  a  proper 
noun. 
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6.  If  the  sentence  is  too  long,  delete  all  conjunctions  In  step  4,  we  had  the  option  of  leaving  phrases  containing  a 
proper  noun  intact.  If  we  did  so,  and  the  sentence  is  too  long,  we  now  delete  all  these  phrases. 

7.  If  the  sentence  is  too  long,  delete  all  relative  clauses  In  step  5,  we  had  the  option  of  leaving  phrases  containing  a 
proper  noun  intact.  If  we  did  so,  and  the  sentence  is  too  long,  we  now  delete  all  these  phrases. 

8.  While  the  sentence  is  too  long,  delete  prepositional  phrases  which  do  not  contain  a  proper  noun  We  assume  that 
the  deepest  prepositional  phrase  is  the  least  important.  E.g.  in  ’’The  prince  of  the  smallest  country  in  the  world”, 
”in  the  world”  is  probably  the  least  important  part  of  the  phrase.  Therefore,  while  the  sentence  is  too  long,  we  find 
the  deepest  prepositional  phrase  which  does  not  contain  a  proper  noun,  and  delete  it. 

9.  While  the  sentence  is  too  long,  delete  all  prepositional  phrases  If  the  sentence  is  still  too  long,  we  repeat  step  8, 
except  that  a  phrase  is  deleted  regardless  of  whether  or  not  it  contains  a  proper  noun. 

10.  Clean  up  any  dangling  connectives  The  deletion  process  leaves  some  connectives  dangling.  Here,  we  delete  any 
connective  which  doesn’t  have  siblings. 

SAMPLE  RUN 

>  more  test 

<doc  doc_id="UN-20 . 100-23"  sys_id="SRC-SP"> 

<segment> 

Este  Itimo  misilpuede  equiparse  con  las  ojivas  nucleares  que  se  estn  produciendo  en  Israel. 
</segment> 

</doc> 

>  depTrimmer+matador .pi  test  out  TEST  params=params . deptrimmer . 1 

parameter  params  =  params .deptrimmer . 1  ...  loading 

torotemp.*:  No  such  file  or  directory 

Processing  Batch  #0 
PARSING. . . 

TRANSLATING. . . 
reading 

/fs/clip-plus/habash/PACKAGE/GHMT-PAK/GHMT/TRANSLEX/Spanish-English/span-eng.tralex  . . .  done 
translating  torotemp. dep  ... 
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done 


CONVERTING, EXPANDING. . . 

<Running  CorExerge> 

;  Exiting  Lisp 
<Running  depTrimmer> 

;  Exiting  Lisp 
LINEARIZAING.  .  . 

<Running  OxyGen  2.0> 

;  Exiting  Lisp 
RANKING. .  . 

/f s/clip-plus/habash/PACKAGE/GHMT-PAK/ GHMT/EXERGE/halogenize  torotemp . out . gls 
torotemp. out . txt  5  /f s/clip-plus /habash /PACKAGE /GHMT-PAK/GHMT /MATADOR/un500k .binlm 

<GLS-to-Forest  Conversion>  &&  <Running  HALOGEN> 

/f s/cliplab/muri/habash/HALOGEN/ForestRanker/polishsen .pi  /tmp/halol in-temp . senO 

>  /tmp/halolin-temp. sen 

;  cpu  time  (non-gc)  180  msec  user,  0  msec  system 

;  cpu  time  (gc)  50  msec  user,  0  msec  system 

;  cpu  time  (total)  230  msec  user,  0  msec  system 

;  real  time  824  msec 
;  space  allocation; 

;  132,271  cons  cells,  3,129,408  other  bytes,  0  static  bytes;  Exiting  Lisp 

REPORTING. . . 
done ! 

>  more  out 

<doc  doc_id="UN-20 . 100-23"  sys_id="TEST"> 

<segment> 

Last  missile  can  be  equipped  with  nuclear  warheads 
</segment> 

</doc> 
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